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Abstract. Erikkson showed that singular value decomposition (SVD) of fiatten- 
ings determined a partition of a phylogenetic tree to be a split ([?])• In this paper, 
based on his work, we develop new statistically consistent algorithms fit for grid 
computing to construct a phylogenetic tree by computing SVD of flattenings with 
the small fixed number of rows. 



1. Introduction 

Phylogenetic analysis of a family of related nucleic acid or protein sequences is 
to determine how the family could have been derived during evolution. Assume 
that evolution follows a tree model with evolution acting independently at different 
sites of genome. Let the transition matrices for this model be the general Markov 
model which is more general than any other in the Felsenstein hierarchy. How to 
reconstruct evolutionary trees is one of the main objects in phylogenetics. 

Since statistical models are algebraic varieties, we are interested in defining poly- 
nomials called phylogenetic invariants for varieties. Many authors have studied 
phylogenetic invariants for different models (p], [3J, [9], [13], [IS]). Phylogenetic 
invariants have been used for phylogenetic tree reconstruction ([5]). 

Procedures for phylogenetic analysis are linked to those for sequence alignment. 
We can easily organize a group of similar sequences with a small variation into a 
phylogenetic tree. On the other hand, as sequences become more different through 
evolutionary change, as they can be more difficult to be aligned. A phylogenetic 
analysis of very different sequences is also hard to do since there are many possible 
evolutionary paths that could have been followed to produce the observed sequence 
variation. To solve these difficulties and complexities many phylogenetic analysis 
programs have been invented. The main ones in use are PHYLIP (phylogenetic in- 
ference package, [8]) available from Dr. J. Felsestein and PAUP (|17j) available from 
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Sinauer Associates, Sunderland, Massachusetts. Nowadays these programs provide 
three methods for phylogenetic analysis - Parsimony, distance, and maximum like- 
lihood methods - and also give many evolutionary models for sequence variation. 

Note that splits in a phylogenetic tree play an important role in reconstructing the 
phylogenetic tree ([IH]). Recall that Erikkson suggested a phylogenetic tree build- 
ing algorithm using SVD of flattenings in Chapter 19 ([7]) of [10]. In that article, 
he tried to build a phylogenetic tree without concerning the notion of distance by 
concentrating on the phylogenetic invariants which are given by rank conditions of 
flattenings. On the other hand, he had difficulty in dealing with the phylogenetic 
tree having a large number of leaves since he had to compute SVD of flattenings of 
huge size. In this paper, we construct algorithms with SVD of flattenings of fixed 
number of rows, i.e., 16. We will present tree building algorithms (Algorithm 1 and 
Algorithm 2) in section 3,4 using SVD to calculate how close a matrix is to be a 
certain rank. In section 5, we use the program seq-gen (|12j) to simulate data of 
various lengths for the phylogenetic tree. After that we build a phylogenetic tree 
using Algorithm 1, Algorithm 2 and Neighbor joining algorithm (NJ). It turns out 
that our algorithms are efficient to construct the phylogenetic tree involving n > 15 
species for DNA sequences with respect to the numerical stability. Also we com- 
pare our algorithms to NJ using simulated and real Encode data. Our algorithms 
are suitable to construct phylogenetic trees for general Markov models, i.e. models 
coming from real data. 
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2. Notations and Preliminaries 

In this section we explain known results by Erikkson and basic concepts in the 
book: Algebraic statistics for computational biology (P2]) for the later use. We will 
present basic theorem which plays an important role in this paper. 

A phylogenetic X-tree T is a tree with leaf set X and no vertices of degree two. 
If every interior vertex of a X-tree has degree three, then T is called a trivalent tree. 
A split A\B of X in a tree T is a partition of the leaves into two non-empty blocks, 
A and B. Removing an edge e from a phylogenetic X-tree T divides T into two 
connected components, which induces a split of the leaf set X. We will call this the 
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split associated with e. The collection of all the splits associated with the edges of 
T is called the splits of T denoted by <S(T). Two splits ^il-Bi and A 2 \B 2 of X are 
compatible if at least one of the four intersections Ai n A 2 ,A\ n B 2 ,Bi H A 2 , and 
B\ n B 2 is empty. Also note that a collection S of splits of X is compatible if it is 
contained in the splits of some tree T ([2]). We adopt all of these notations in [3]. 

Theorem 2.1 (|10j). A collections of splits of X is pairwise compatible if and only 
if there exists a tree T such that S = S(T). 

Let X = [n] := {1, • • • , n} and m be the number of states in the alphabet, 

2, £ = {0,1} 
4, X = {A,C,G,T}. 

Set Pi r ..i n is the joint probability that leaf j is observed to be in state ij for all 
j £ {1, ■ • • , n}. Write P for the entire probability distribution. 

Definition 2.2. A flattening along a partition A\B is the m'" 4 ' by m' s ' matrix where 
the rows are indexed by the possible states for the leaves in A and the columns are 
indexed by the possible states for the leaves in B. The entries of this matrix are 
given by the joint probabilities of observing the given pattern at the leaves. We write 
Flat^^(P) or shortly Fa\b(P) for this matrix. 

Next we define a measurement that a general partition of the leaves is close to a 
split. If A is a subset of the leaves of T, then let Ta be the subtree induced by the 
leaves in A. That is, Ta is the minimal set of edges needed to connect the leaves in 
A. 

Definition 2.3. Suppose that A\B is a partition of [n]. The distance between the 
partition A\B and the nearest split, written e(A,B), is the number of edges that 
occur in Ta D Tg . 

Notice that e(A,B) = exactly when A\B is a split. Consider Ta H Tb as a 
subtree of Ta- Color the nodes in Ta H Tb red, the nodes in Ta \ {Ta H Tb) blue. 
Say that a node is monochromatic if it and all of its neighbors are of the same color. 
We let mono(j4) be the number of monochromatic red nodes. 

Definition 2.4. Define mono(A) as the number of nodes in Ta n Tg that do not 
have a node in Ta \ (Ta n Tg) as a neighbor. 

The following theorem shows how close a partition is to being a split with the 
rank of the flattening associated to that partition. Originally this theorem is proved 
for the case that T is a trivalent tree. On the other hand, we have the same result 
for the non-trivalent tree T whose proof is almost same as original one in [7] . 

3 



7 



8 



Figure 1. Non-trivalent tree 



Theorem 2.5. Let A\B be a partition of [n], T be an unrooted tree which is not 
necessarily trivalent with leaves labeled by [n], and assume that the joint probability 
distribution P comes from a Markov model on T with an alphabet with m letters. 
Then the generic rank of the flattening F^\g(P) is given by 

mm(m e ^ +1 - mono ^ , m ^B)+i-mono{B) > J A\ ? m \B\ ) 

Proof. Refer to [7], Theorem 19.5. □ 

Using Theorem 12.51 we have the following corollaries. 

Corollary 2.6. If A\B is a split in the tree, the generic rank of Fa\b{P) is m. 

Corollary 2.7. If A\B is not a split in a trivalent tree and we have \A\,\B\ > 2 
then the generic rank of Fa\b{P) is a t least m 2 . 

For the non-trivalent tree case we have a different result comparing to Corol- 
lary [221 i.e., generic rank of Fa\b{P) 1S a t least m. The reason for the different 
result comes from considering the following 4-valent tree with A = {1,2}, B = 
{3, 4, 5, 6, 7, 8}. Actually, in this case we have R = {vi}, e(A, B) + 1 — momo(A) = 
1, e(A, B) + 1 — momo(B) = 1 for non-split A\B. Hence we get ranki ? ^|^(P) = m. 

A singular value decomposition ofamxn matrix A (with m > n) is a factorization 
A = UT,V T where U is m x n and satisfies U T U = I, V is n x n and satisfies V T V = I 
and S = diag(o"i, a%, ■ ■ ■ , a n ), where o\ > 02 > ■ ■ ■ > a n > are called the singular 
values of A. 

We need the following theorem to define svd distance in the next section. 
Theorem 2.8 ([6], Theorem 3.3). The distance from A to the nearest rank k matrix 



is min \\A — B\\p = 
rank(B)=k \ 



of in the Frobenius norm. 

\=k+l 



3. Algorithm for constructing a phylogenetic tree 



In this section, we have an algorithm for constructing a phylogenetic tree us- 
ing SVD of flattenings which improves Erikkson's algorithm in view of numerical 
stability (cf. [7]). First we define a function called svd-cherry as follows. 

Definition 3.1. For each distinct pair (sj, Sj) in the k species Sk = {si, S2, ■ ■ ■ , Sfc}, 
svd distance dp(si, Sj,S k ) between Si, and Sj in S k is defined by the distance from the 
flattening F{ai,sj}\{sx,— ,s k }\{si,sj}{P) to the nearest rank m matrix in the Frobenius 
norm. 

Definition 3.2. For given k species S k = {s\, S2, ■ ■ ■ , Sk}, Define 

svd-cherry (si,s 2 , ■ • • ,s k ) := (s^ , sj* , v) 
so that the pair (sj* , Sj* ) in Sk and their svd distance v in Sk satisfies that 
v := d F (si*,Sj*,S k ) = mm{d F (si,Sj,Sk) \ G [k] x [k], i^j}. 

Using Definition 13.21 we have the following algorithm. 

Algorithm 1 (Building a phylogenetic tree using SVD of flattenings) 
Input: A multiple alignment of genomic data from n species from the alphabet X 
with m states. 

Output: An unrooted phylogenetic tree T with n leaves labeled by the species. 
Initialization: Partition n species s\, S2, • ■ ■ ,s n into n singletons as C^i, Ci^, ■ ■ ■ , C\ t 
Loop: For k from 1 to n — 3, perform the following steps. 

Step 1: For each n — k + 1 species s\, S2, ■ • • , s n -k+i where s/ G Ck,i is a repre- 
sentative of Ck^u l<Z<w — k + 1, find a distinct pair of clusters (C^^* , Ckj*) such 
that svd-cherry(si,s 2 ,--- ,s n - k +i) ■= (si*,Sj*,v) for Sj» € C kti *,Sj* G C k j*. 

Step 2: Choose the pair of clusters (C^o, Ckj°) which occurs most frequently 
in Step 1. 

Step 3: Join Ck/ L ° and Ckj° together in the tree and consider this as a new 
cluster Ck+i^i- After that rename the remaining Ck/s as C^+i^, • • • , C k+ i jn _ k - 



Proposition 3.3. Algorithm 1 needs to compute SVD at most ^/^[(fO^l (2) times. 
Here [(? ) k ] is the maximum possible integer not greater than (f 

Proof. If we have k clusters C\ , C2, ■ ■ ■ , Ck , then the number of possible represen- 
tatives for each clusters is Y\i \C-i\- Since Yli |C«I = n > III |C« | has maximum 
where \C\\ = | C 2 ] = ••• = \Ck\ = f- Here |Cj| is the cardinality of Cj. Thus we 
manipulate SVD at most X^fc=4l(l ; ) fc ] (2) times in total for the flattenings with fixed 

number of rows, i.e., m 2 . □ 
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Each cluster C^i, 1 < k < n — 2, 1 < I < n — A; + 1 in Algorithm 1 means 
a split in the tree. In Algorithm 1 we have the following hierarchy of Cj-i's. In 
the Initialization, C\ i (1 < I < n) mean n trivial splits, in other words, outer 
edges in tree T. At the end of the first loop, there is one new cluster among 
C2,i (1 < I < n — 1)) which means one new split in T. At the end of each k-th. loop 
from k = 1 up to k = n — 3, we obtain one new edge in T. In total we have the 
exact 2n — 3 splits in T. 

k = 1 : Ci t i Ci t 2 Ci,3 Ci )n _2 Ci jn _i Ci :n 

k = 2 \ 6*2,1 C2,2 6*2,3 C < 2,n-2 C2,n-l 

k = 3 : 6*3,1 6*3,2 63,3 C3, n _2 

k = n — 3 : C^^i C n _3,2 C n _3,3 C n _3,4 
k = n — 2 : C n _2,i C n _2,2 C' n _2,3 

The matrix size vn? x m k ~ 2 of flattenings may be large where k varies from n to 
4, on the other hand flattenings are very sparse. Thus, it is faster to compute the 
eigenvalues of A 1 " A of fixed size m 2 x m 2 for every flattening A than singular values 
of A itself of size m 2 x m fc ~ 2 . Erikkson computed singular values of flattenings of 
various huge size m'^' x m} B \ where A\B is a partition of [n]. That must cause 
numerical instability. We, however, avoid computational difficulties which come 
from numerical instability since we only deal with A T A of fixed size m 2 x m 2 for 
every flattening A. Although we also have difficulty in computing lots of SVD of 
matrices of fixed size m 2 x m 2 , if the number of species grows, Algorithm 1 is fit 
for parallel computing, especially, grid computing which arranges lots of volunteer 
computing resources to do distributed computing. 

Theorem 3.4. Algorithm 1 is statistically consistent. 

Proof. By Corollary |2.6l we can see that dp{si, Sj,Sk) goes to if {sj, Sj}|{si, • • • , Sk} 
\{ s ii s j} is a t rue split of Sk = {s\,--- , Sk}. While Corollary 12.71 shows that 
d,F(si, Sj,Sk) does not go to if {si, Sj}\{si, ■ ■ ■ ,Sk}\{si,Sj} is a partition which 
is not a split of <S/%. Hence, as the empirical distribution approaches the true one, 
the distance of a split from rank m will go to zero while the distance from rank m 
of a non-split will not. Therefore Algorithm 1 picks a correct split at each loop. □ 

Example 

We begin with an alignment of DNA data of length 1000 for 6 species, labeled 
1, ■ • • ,6, simulated from the tree in Figure with all branch lengths equal to 0.1. 
For the loop (k = 1), let C lfl = {1},C 1>2 = {2},C 1>3 = {3},C 1>4 = {4},C 1>5 = 

6 
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Figure 2. Model tree for 6 species. 

{5},Ci ; 6 = {6} and consider all pairs of the 6 species. In the following results, the 
svd-val(si, S2IS3, • • ■ ,Sk) is the distance from the flattening F{si,s 2 }\{s 3 ,- ,s k }(P) to 
the nearest rank 4 matrix in the Frobenius norm as in Theorem 12.81 
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After first loop, since svd-cherry(l, 2, 3, 4, 5, 6)=(5, 6, 0.0076), we have a new clus- 
ter C 2 ,i = {5, 6} and rename C 2 , 2 = {1}, C 2 , 3 = {2}, C 2 , 4 = {3}, C 2 , 5 = {4}. 



loop: k = 2 
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svd-val( 2, 4 | 1, 3, 5 ) = 0.2095 
svd-val( 2, 5 | 1, 3, 4 ) = 0.0290 
svd-val( 3, 4 | 1, 2, 5 ) = 0.0127 
svd-val( 3, 5 | 1, 2, 4 ) = 0.2101 
svd-val( 4, 5 | 1, 2, 3 ) = 0.2099 
min value cherry = (3, 4 ) 
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For the loop (k = 2), first take 5 as a representative of 6*2,1 = {5, 6} and get svd- 
cherry(l, 2, 3, 4, 5) =(3, 4, 0.0127). Next choose 6 as a representative of C2,i = {5, 6} 
and get svd-cherry(l, 2, 3, 4, 6) =(3, 4, 0.0127). Most frequent pair of clusters in 
C2,i,-" ,6*2,5 is (6*2,4, 6*2,5). We obtain a new cluster 63,1 = {3,4} and rename 
63,2 = {5,6}, 63,3 = {1}, 63,4 = {2}. 

loop: k = 3 
svd-val( 1, 2 I 5, 3 ) = 0.0213 
svd-val( 1, 5 I 2, 3 ) = 0.0208 
svd-val( 1, 3 I 2, 5 ) = 0.0285 
min value cherry=( 1,5) 



svd-val( 1, 2 I 6, 3 ) = 0.0167 
svd-val( 1, 6 I 2, 3 ) = 0.0207 
svd-val( 1, 3 I 2, 6 ) = 0.0252 
min value cherry=( 1,2) 



svd-val( 1, 2 I 5, 4 ) = 0.0214 

svd-val( 1, 5 I 2, 4 ) = 0.0221 

svd-val( 1, 4 I 2, 5 ) = 0.0295 
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min value cherry=( 1,2) 



svd-val( 1, 2 | 6, 4 ) = 0.0168 
svd-val( 1, 6 | 2, 4 ) = 0.0219 
svd-val( 1, 4 | 2, 6 ) = 0.0262 
min value cherry=( 1,2) 

For the loop (k = 3), first take 3 as a representative of C3 i = {3,4}, 5 as a 
representative of C32 = {5,6}, then we get svd-cherry(3, 5, 1, 2)=(1, 5, 0.0208). 
Next choose 3 as a representative of C3 1 = {3,4}, 6 as a representative of 63,2 = 
{5,6}, get svd-cherry(3, 6, 1, 2)=(1, 2, 0.0167). By the same manner we have svd- 
cherry(5,4, 1,2)=(1, 2, 0.0214), svd-cherry(6, 4, 1, 2)=(1, 2, 0.0168). Most frequent 
pair of clusters in C^i,-- - ,63,4 is (6*3,3, 63,4). We obtain a new cluster = 
{1,2} and rename 64,2 = {5,6}, 64,3 = {3,4}. We can join these three clusters 
C^i, 64,2, 64,3 to make an unrooted tree. 

4. Simplified tree constructing algorithm 

In Algorithm 1, if we can reduce the number of feasible representative of each 
cluster Ckj using some available a priori information, then computational cost can 
be saved. In this section, for example, we choose the unique feasible representative 
of each cluster which has the smallest distance from species outside the cluster. 

Algorithm 2 (Simplified tree constructing algorithm) 

Input: A multiple alignment of genomic data from n species from the alphabet E 
with m states. 

Output: An unrooted phylogenetic tree T with n leaves labeled by the species. 
Initialization: Partition n species S\, S2, ■ ■ ■ ,s n into n singletons as Ci : i,C\ : 2, ■ ■ ■ , C\. 
Loop: For k from 1 to n — 3, perform the following steps. 

Step 1: For each cluster Ckj, l<j<n — k + 1, choose the representative 

€ Ckj by the following; 

For all s £ Ckj, calculate v s = J2s'e[n]\c k 3 p( s -> s> ) where p(s,s') is the 
proportion of different nucleotides between two species s,s'. Choose € Ckj 
which has the smallest value v s for all s € Cfej- 
Step 2: For (sf, s^, • • • , s^_ k+l ) where sf € C^i, 1 < I < n—k+1, find a distinct 
pair of clusters (C^* , Ckj*) such that svd-cherry(sf , s^, • • • , s^_ k+1 ) := (sj* , Sj* , v) 
for Si* € C k ,i*,Sj* G C k ,j*. 

Step 3: Join C k ,i* and Ckj* together in the tree and consider this as a new 
cluster Ck+i,i- After that rename the remaining Ck/s as Ck+1,2, ■ " , Cfc+i,n-fc- 
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Figure 3. Model trees (A) (left) and (B) (right). 



Note that Algorithm 2 is much faster to construct phylogenetic trees with many 
leaves since it uses only one representative for each cluster Ck \ ■ 



5. Performance analysis of tree constructing algorithms 

5.1. Building phylogenetic trees with simulated data. We chose phylogenetic 
tree models as in Figure and simulated DNA sequence data on these trees using 
the program seq-gen (|12j). Figure [3] shows variables a, b, c in the trees. These trees 
were chosen as difficult trees in p3]. Next, we built trees using Algorithm 1,2 and 
neighbor joining algorithm with Jukes-Cantor distance from these data, respectively. 
For each algorithm, we plotted percent of tree reconstructed among 1000 DNA data 
set for various sequence lengths. 

Figure U] shows the results for the case of a = .01, b = .04, c = .07 for both model 
trees in Figure [3l The results for the case of a = .02, b = .13, c = .19 are shown in 
Figure [5j Algorithm 1 shows better performance than Algorithm 2, but, worse than 
the neighbor joining algorithm. It might be expected because the used DNA data 
were simulated by distance based algorithm. 

We tested Algorithm 1,2 to reconstruct a tree with many species, for example 32 
species. We simulated 100 DNA data sets of length 1000 for 32 species from the 
tree in Figure with all branch lengths equal to .1. We got 94 % of reconstruction 
rate using Algorithm 2, whereas 99 % of reconstruction rate with neighbor joining 
algorithm. For the Algorithm 1, we tested only 1 data set using parallel cluster 
machine with 10 Athlon 2600 CPUs. It took about 3 hours for 1 data set. The loop 
number which took longest time was 19. Algorithm 1 did 2 16 ( 1 2 4 ) SVD computations 
in 19-th loop. The important point is that we change the type of difficulty in dealing 
with rebuilding tree of many species from time and numerical instability to time only. 
Furthermore, the difficulty in time can be overcome in various ways. 
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Figure 4. Simulation results of tree construction methods for Model 
trees A(left) and B(right) in the case a = .01, b = .04, c = .07. 




Figure 5. Simulation results of tree construction methods for Model 
trees A(left) and B(right) in the case a = .02, b = .13, c = .19. 

5.2. Building phylogenetic trees with real data. For data, we use the Sep- 
tember 2005 freeze of the ENCODE alignments. We restrict our attention to the 
problem of constructing phylogenetic tree for 8 species: human, chimp, galago, 
mouse, rat, cow, dog, and chicken, which is called rodent problem. We processed 
each of the 44 ENCODE regions to obtain data sets which have ungapped columns 
greater than 100 bps in length. We obtain 75 data sets in manually chosen 14 Enm 
regions and 301 data sets in all 44 Encode regions. 

Recall that the Robinson-Foulds metric which is also called the partition metric 
was proposed by [IT] is one of the simplest metrics on trees. The distance between 
two trees T% and T 2 is defined by 

d RF (Ti,T 2 ) = ^(\S(Tx) - S(T 2 )\ + \S(T 2 ) - 5(Ti)[). 

Here S(T) is the set of splits in T and \S(Ti) — S(Tj)\ is the cardinality of the set 

5(Tj) — S(Tj). The symmetric distance d s (Ti,T2) is twice of ciRi?(Ti, T2). 
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Figure 6. Phylogenetic tree with 32 leaves whose all of the edges 
have branch length .1. 

Rodents have very different morphological features, although their molecular data 
is similar to that of the primates. Thus, lots of biologists pay attention to ro- 
dents. Tree construction algorithms using genomic data usually misplace the ro- 
dents, mouse and rat, on the tree, with respect to other mammals. According to 
fossil records and molecular data, we have the biologically correct tree that is not 
sure whether it is correct. In this tree we have the primate clade with human and 
chimpanzee and then the galago as an outgroup to these two. The rodent clade 
(mouse and rat) is a sister group to the clade (human, chimpanzee and galago) and 
the clade (dog and cow) is the outgroup to former 5 species. The chicken is an 
outgroup to all of these as a root of this phylogenetic tree. On the other hand, usual 
tree reconstruction algorithm mislocate the rodents and so locate them as an out 
group to the clade (human, chimpanzee and galago) (See Figure [7]). 

The reasons for this are not entirely known, but it could be since tree construction 
methods generally assume the existence of a global rate matrix for all the species. 
However, rat and mouse have mutated faster than the other species. Our algorithms 
does not assume anything about the rate matrix ([10], Chapter 21). 

In fact, Table [T] shows that our algorithms performs quite well on the ENCODE 
data sets comparing to N J (neighbor joining algorithm with Jukes-Cantor distance) 
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Figure 7. Biologically correct tree(left) and the tree which is ob- 
tained by usual algorithm(right). 



method 


Algorithm 1 


Algorithm 2 


NJ 


All (P c ) 


11.9 


10.6 


11.6 


All (d s ) 


2.57 


2.85 


2.77 


Enm(P c ) 


12.0 


12.0 


14.6 


Enm (d s ) 


2.14 


2.52 


2.45 



Table 1. Comparing results for algorithms on data from Encode 
project. P c is percent of trees reconstructed and d s is symmetric 
distance. 



algorithm. Algorithm 1 constructs the correct tree similar to NJ (cf. [7], p. 357), 
but, has shorter symmetric distance d s on average than NJ algorithm. 
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